Goto

Collaborating Authors

 learningguidancerewardswith trajectory-spacesmoothing


AAppendix: LearningGuidanceRewardswith Trajectory-spaceSmoothing A.1 Monte-CarloEstimateoftheGuidanceRewards

Neural Information Processing Systems

LetZπ(s,a) be the random variable denoting the sum of discounted rewards along a trajectory starting with the state-action pair(s,a).